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Abstract 

Admixed populations are formed by the merging of two or more ancestral 
populations, and the ancestry of each locus in an admixed genome derives from 
either source. Consider a simple “pulse” admixture model, where populations 
A and B merged t generations ago without subsequent gene flow. We derive the 
distribution of the proportion of an admixed chromosome that has A (or B) an¬ 
cestry, as a function of the chromosome length L, t, and the initial contribution 
of the A source, m. We demonstrate that these results can be used for inference 
of the admixture parameters. For more complex admixture models, we derive 
an expression in Laplace space for the distribution of ancestry proportions that 
depends on having the distribution of the lengths of segments of each ancestry. 
We obtain explicit results for the special case of a “two-wave” admixture model, 
where population A contributed additional migrants in one of the generations 
between the present and the initial admixture event. Specifically, we derive for¬ 
mulas for the distribution of A and B segment lengths and numerical results for 
the distribution of ancestry proportions. We show that for recent admixture, 
data generated under a two-wave model can hardly be distinguished from that 
generated under a pulse model. 


1. Introduction 


Present-day genomes are mosaics of ancestries from the different sources 


that merged to form modern populations (e.g., Price et al. (20091; Brisbin et al. 


(2012); Maples et al. (20131). Estimating the time of each admixture event and 
the relative contribution of each source population is an important problem in 


population genetics. To estimate admixture times 

, Johnson et al. ( 

20111 fit- 

ted the number of ancestry switches; 

Pugach et al. 

(2011 

), and later Sanderson 


et al. (2015), matched simulations to the typical segment length, as estimated 
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from a wavelet transform of the local ancestry along the genome; and Pool 


and Nielsen (2009), as well as Gravel (2012), fitted the distribution of seg¬ 


ment lengths. However, these methods require an accurate identification of the 
boundaries of admixture segments, which is not always available, in particular 


for computationally phased data. Reich and colleagues ( 

Moorjani et al. 

2011 

Patterson et al. 

2012 

Loh et al.| 2013| Pickrell et al. 

2014 

) fitted the decay of 


admixture linkage disequilibrium (LD) with genetic distance, but such methods 


can be confounded by background LD. Hellenthal et al. (2014) recently pro¬ 


posed a promising approach based on the probability of two fixed loci to have 
given ancestries. Gene flow parameters can also be inferred using more general 
demographic inference methods, e.g., based on the allele frequency spectrum 


(ExcofSer et al. 


2013 Gutenkunst et al. 2009) or segment sharing (Palamara 


and Pe’er 2013); however, to use these methods one must specify and infer a 


model for the entire history. 


Recently, Rosenberg and colleagues (Verdu and Rosenberg 2011 Goldberg 


et al. 

2014 

Goldberg and Rosenberg 

2015 

1, 

Liang and Nielsen 

(2014a 

), and 

Gravel 

(2012 

) derived analytical results for the moments of the ancestry pro- 


portion, namely, the fraction of a chromosome that descend from a given source 


population. These ancestry proportions can be reliably inferred (e.g., Alexan- 


der et al. 

(2009) 

Pritchard et al. 

(2000 

)), and the derived moments have been 

used for admixture time inference (e.g., 

Botigue et al. 

(2013) 

Liang and Nielsen 


(2014b); Moreno-Mayar et al. (2014)). However, these methods did not make 
use of the entire distribution, as no analytical results were available. Here, we 
derive first results for the distribution of the ancestry proportions. We obtain 
an explicit formula for the case of a “pulse” admixture event, and we demon¬ 
strate its application to the inference of the admixture parameters. We then 
derive an expression in Laplace space for a general admixture model of arbi¬ 
trary complexity, but which requires knowledge of the distribution of admixture 
segment lengths. For the special case of a “two-wave” admixture, we obtained 
the segment length distribution, and consequently, through numerical Laplace 
inversion, the distribution of ancestry proportions. 


2. The distribution of ancestry proportions under pulse admixture 

Consider the pulse model, where an admixed population formed t generations 
ago as a result of merging of populations A and B, and where the proportions 
of ancestry contributed by A and B were m and 1 — to, respectively. Under 
this model, each locus in a chromosome of a present-day admixed individual 
can trace its origin to A or B with probabilities to and 1 — to, respectively. 
We assume lineages break apart along the chromosome, due to recombination, 
at rate t per Morgan. Ignoring genetic drift and the underlying pedigree, we 
assume that upon recombination, the new source population is selected at ran¬ 
dom. Therefore, a recombination event will lead to a change of ancestry from 
A to B with probability 1 — to and from B to A with probability to. The 
lengths chromosomal segments with A ancestry will thus be exponentially dis¬ 


tributed with rate (1 —to)<, and similarly for the B segments (rate mt) (Gravel 
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20121. We neglect the first generation after admixture, where A and B seg¬ 
ments do not yet mix ( Gravelf 20121. As pointed out by Liang and Nielsen 
(2014aI, the assumption of independent and exponentially distributed segment 
lengths breaks down for very short and very long times since admixture, due 
to the effect of the underlying pedigree and the accumulation of genetic drift, 
respectively. Nevertheless, for the relevant time-scales and effective population 
sizes of many human populations (around 10-200 generations and effective size 
in the thousands), segment lengths should be extremely well approximated by 
independent exponentials. 

Given a chromosome of length L (Morgans), the ancestry along the chromo¬ 
some can be modeled as a two-state process with states A and B, and with the 
distribution of segment lengths in each state given above. We are interested in 
the distribution of x, the fraction of the chromosome in state A. Adopting a 


result of Stam (1980), the desired distribution is given by 


/(x; L) = {1- m)e-'"^(5(x) -b me-(i-'")^(5(l - x) 
-b m(l - x 

mx -b (1 — m)(l — x) 


( 1 ) 


h{2ha) + 2Io{2ha) 


where h = tL, a = -yrri^W^rnyx^W^a)), and Iq and Ii are the modified Bessel 
functions of the hrst kind of order 0 and 1, respectively. Note the delta functions 
at X = 0 and x = 1, corresponding to the probability of the entire chromosome 
having B only or A only ancestry, respectively. The mean ancestry proportion 
satishes (x) = m, as expected. The variance is given by 


Var I 


2to( 1 — m) ^ _fi 






( 2 ) 


in agreement with Eq. (A16) in Gravel (2012). Note that Eq. ([^ represents 


the distribution of the ancestry proportion across repetitions of the ancestral 
process (for a single chromosome), rather than across chromosomes from a sam¬ 
ple (Gravel 2012[ Liang and Nielsen 2014b). However, unless the population is 
very small (compared to the admixture time and the sample size), histories of 
different chromosomes are to a good approximation independent, and the two 
distributions are the same. 


3. Inference of admixture times 

In theory, given the observed ancestry proportion for each chromosome in 
a sample, Eq. 0 can be used to compute the likelihood of the observed data 
for given admixture parameters. In practice, in the absence of trios or pedigree 
information, phase switches are abundant, and hence, it is difficult to accurately 
determine the ancestry proportion per chromosome. However, it is still possi¬ 
ble to determine the diploid ancestry proportion, y = (xi -b X2)/2. Given that 
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Figure 1: Inference of admixture times using the distribution of ancestry proportions. We 
simulated an admixture pulse history under the Markovian Wright-Fisher model of |GraveI| 
( 2012[ |. The model assumes that the 2N haploid chromosomes in the current generation are 
formed by following a Markovian path within the 2N chromosomes of the previous generation. 
Ancestry changes occur as a Poisson process with rate 1 (per Morgan). Each chromosome 
in the first generation is assigned to population A or B with probabilities m and 1 — m, 
respectively, and the evolution of the chromosomes is traced for t generations. We used m = 
0.5, L = 2M, and N = 2500, and varied t. Ancestry proportions from pairs of chromosomes 
were averaged to generate diploid individuals. We then set the inferred m to the mean A 
ancestry, and used the distribution of ancestry proportions, Eq. Q, to infer the admixture 
time t. Each dot in the plot shows the inferred time, t, for one simulation. The dotted red 
line corresponds to t = t, and the dashed purple line to the mean inferred time, (£). 


homologous chromosomes have independent histories, the diploid ancestry pro- 
portion distribution, fj^{y\L), can be computed from Eq. 0 by convolution. 
Suppose we are now given the diploid ancestry proportions yij for individuals 
i = and for chromosomes j = 1,...,22 (where each chromosome has 

length Lj). Assuming chromosomes are independent both within and between 
individuals, the likelihood of the data is given by 

n 22 

likelihood = (3) 

i=i 


Maximum likelihood estimates (MLE) for m and t can then be obtained by 
a simple grid search. Simulation results with perfect knowledge of segment 
boundaries demonstrated that the method can infer correctly both m and t 
(Figure [^, although the variance increases with t, as expected. Coalescent 
simulations followed by inference of ancestry proportions using ADMIXTURE 
(Alexander et al. 2009) and application of our method demonstrated again high 
accuracy, at least as long as the A and B population were sufficiently diverged 
(not shown). However, when A and B are closely related, the distributions of 
the true and inferred ancestry proportions may differ, affecting the accuracy of 
the method. 
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4. The distribution of ancestry proportions under a general distribu¬ 
tion of segment lengths 

We have so far considered a simple pulse model, under which the distri¬ 
butions of segment lengths are exponential with known rates. Under a more 
complex admixture history, we assume that the distributions of the lengths of 
A and B segments take the general form (7 a(^) and We still assume 

that A and B segments are independent (see below). The process can then be 
modeled as a two-state process. We start on the left end of the chromosome in 
state A or B with probabilities pA = {^a) / ((^a) + (^s)) and 1 — pA, respec¬ 
tively (where {£a) and {is) are the mean segment lengths), and draw a random 
segment length from the selected ancestry. When the first segment terminates, 
we switch ancestries and draw a segment length from the other ancestry, and 
so on until we reach the end of the chromosome. 

The distribution of x, the A ancestry proportion, can be computed in 
Laplace space by extending renewal theory methods developed in the physics 


domain (e.g., Godreche and Luck (2001); Margolin and Barkai (2004)). Let s be 
the Laplace pair of L (the total chromosome length) and u as the Laplace pair of 
La = xL (the total chromosome length covered by A segments). We transform 
the density f{LA]L) (from which the density of x can be easily obtained) to 
f{u; s), and after some calculations, we obtain 

?/ . N S [1 - QAjs + u)fe(s)] + M [1 - fe(s)] {1 - PA [1 - qAjs + m)]} , . 

s{s + u) [1 - qA{s + u)q-B{s)] 

In the above equation, qA^s) and 9 b( s) are the Laplace transforms {£ -A s) 
of 9a (^) and 9b (f'), respectively. The details of the derivation are somewhat 
tedious and therefore omitted. It can be shown, using Eq. (|^, that the mean 
ancestry proportion (x) approaches pA as L —)■ oo. It can be also shown that 
Eq. reduces to Eq. 0 for the admixture pulse model. 


5. Conditions under which consecutive segments are independent 

To obtain concrete results for complex admixture histories, we use the model 
developed by Gravel ( |2012 ) (section General incoming migration in the absence 
of drift and Eigure 3 there). Gravel proposed that the ancestry along the chro¬ 
mosome could be described by a Markov process, whose states correspond to 
the identity of the source population (i.e., A or B), combined with the time 
when each segment entered the admixed population. Gravel then derived the 
transition rates for any admixture history. While the extended state space pro¬ 
cess is Markovian under any history, consecutive A and B segment lengths are 
generally no longer independent. However, little thought reveals that as long as 
migration beyond the the initial event is limited to one population, consecutive 
segment lengths remain independent. 
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6. The distribution of segment lengths under a two-wave admixture 
model 

Consider a model where populations A and B have merged ti generations 
ago, contributing proportions m and 1 —m to the admixed population. Then, t 2 
(< ti) generations ago, migrants from population A have replaced a proportion 
fi of the gene pool of the admixed population. No other events then take place 
until the present. The corresponding Markov process, using the method of 
Gravel (2012), has three states: Ai, A 2 , and B, representing migrant segments 


from A at time ti, from A at time t 2 , and from B (at time ti), respectively. 
Let us compute the distributions of the lengths of A and B segments. 

The transition rate is ti when at states Ai and B, and t 2 when at A 2 . It 
can be shown that once a transition is made, the next state is chosen according 
to the following transition probability matrix 


(l-m) 

to( 1 —/i) /r (1 —m)(l —/r) 
(1 _ m) (1 - 


( 5 ) 


The states are ordered as (Ai, A 2 , B) and Vij {i,j = 1, 2, 3) is the probability to 
jump from state i to state j. Note again that we neglected the first generation 


after admixture, during which A and B segments do not yet mix (Gravel 2012). 


It is now easy to see that B segment lengths are distributed exponentially 
with rate ti(l — Ps.s)) or 


9b (^) = ti 


1- (l-m) ( 1-^^ 


exp < —t 


1 - (1 - to ) ( 1 - 


( 6 ) 

This equation was also (implicitly) derived by Ni et al. (2015) in a different 


way. For the A segments, define qai {(■) as the distribution of A segment lengths, 
when the process entered the A states at state Ai, and similarly for qA^ {€}. Since 
the process always enters Ai and A 2 from B (ignoring the leftmost end of the 
chromosome), the distribution of A segments therefore satisfies 


9a(^) = 


B,At 


1 - P 




-9Ai(^) 


B,A2 


1 - p 


B,B 


AaAx)- 


( 7 ) 


To find qAii^) and qA 2 {^)j we write integral equations. 


qaA^) = PAi,Btie"‘"^ + / tie~*^y [PAi,AiqAiA - y)dy + PA,,A2qA2A - y)]dy 

Jo 

pi 

qA2ix) = PA2,Bt2e~*^'^ + / [Pa2,Ai9Ai (^ - J/)d2/+ PA 2 .A 29 A 2 (^ - y)] 

Jo 

( 8 ) 


We solve these equations by using a Laplace transform {£ -A s) and the convo- 
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lution theorem, 


'?Ai(s) — 1 [P Ai,B + PAi,Ai9Ai(s) + PAi,A2 9A2(s)] 

tl + S 

9A2(s) = [Pa2,-B + Pa2,Ai9Ai(s) +Pa2,A29A2(s)] ■ 

t 2 + s 


( 9 ) 


These are two linear equations in two variables (qAiis) and which are 

easily solved. Then, qAi(^) and qA 2 {^) are obtained by Laplace transform in¬ 
version. We then use Eq. 0 to obtain q^i^)- We carried out these steps in 
Mathematica, leading to the result 


, , _ (1 - ''n)e [Cl sinh(/3£/2) -|- C2 cosh(/3£/2)] 

/3 [mti-I-7^2(1 ~ nr)] ’ 

where 7 = ti -|- (1 - m){ti - t 2 n), /3 = - 4 ti< 2 (l - m){l - fi), 

Cl = m^{ti-nt2f’-m{ti-^t2) [t\ - tit2 - [ti 


( 10 ) 


^2(1 - m )] , 


and 


C 2 = [m{ti - ^<2)^ -I- 7(1 - j3. 


We note that we can also view the second migration wave as gene flow coming 
from a third population. Our results then automatically provide the distribution 
of ancestry proportions coming from each of the three sources. 

We ran simulations of the two-wave model under the Markovian Wright- 
Fisher framework described by Gravel] (2012| (see Figure]^. Representative 




simulation results are shown in Figure^ It can be seen that our theory matches 
the empirical data very well. We note, though, that the empirical distributions 
can be htted quite well to an admixture pulse model with parameters rUpuise 
equal to the expected mean (/x -|- to (1 — p,)) and tpuise intermediate between ti 
and t 2 - This suggests that, at least for admixture parameters tested here, any 
inference based on the more complex model may not have sufficient evidence to 
justify the additional gene flow event (see also Hellenthal et al. (2014)). 


7 . The distribution of ancestry proportions for two-wave admixture 

Now that we have q^ {£) and qs {i) for the two-wave model (Eqs. (10) and (j^ , 
respectively), we can use Eq. (j^ for the distribution of the ancestry proportions. 
We inverted f{u; s) with respect to u using Mathematica and then numerically 
with respect to s, to obtain f(x;L). Simulation results are shown in Figure]^ 
demonstrating that our theoretical results fit the empirical data very well. Here 
too, excellent fit is achieved also by the pulse admixture model. 
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L=50(M),t^=6,t2=4,m=0.5,//=0.2 



Segment length i (M) 

Figure 2: Two-wave admixture: simulations and theory for the segment length distribution. 
We simulated a two-wave admixture model according to a Markovian Wright-Fisher model 
l |Gravel| ( |2012[ l; as in Figure Q with N = 2500. The other model parameters are indicated 
on top of the figure. We used a particularly long chromosome to avoid boundary effects. We 
recorded the lengths of segments that descend from A and B populations, and plotted their 
histog ram (circles and squares, respectively). The theoretical distributions, gA(^) a.nd 
(Eqs. ( |10[ | and §, respectively), are plotted as solid lines. We then fitted a pulse admixture 
model with just two parameters (m and t) by matching the means of the empirical A and 
B segment lengths. The distributions of A and B segment lengths under the pulse model 
(exponentials with rates (1 — m)t and mt, respectively) are plotted as dashed light-colored 
lines. The best fit for t was 5.7, intermediate between t\ and t 2 . 


8. Discussion 

We proposed a simple method for inference of admixture parameters based 
on the empirical distribution of the proportion of each chromosome that descend 
from each source population. One advantage of this approach is not having to 
rely on the precise boundaries of the admixture segments. Rather, we only need 
the total amount of genetic material from each source, which is typically easier 
to estimate, even without explicitly performing local ancestry inference (e.g., 
using programs such as ADMIXTURE). Additionally, our method is easily 
adapted to unphased data, and enjoys the advantages of maximum likelihood 
estimation. Finally, we were able to completely generalize the results to the case 
of two-wave admixture, where gene flow from one of the populations occurred 
in two different occasions. 

Extending the results to additional gene flow events (limited to a single 
source population) should be straightforward, at least numerically. However, 
extensions to more complex models or to continuous migration seem compli¬ 
cated. The finite population size could generally be neglected, as long as it is 
much larger than the sample size and (numerically) than the number of genera¬ 
tions since admixture, which is typically the case. For small populations, genetic 
drift has two contrasting effects. The first is to increase the variance of the distri¬ 
bution of the ancestry proportions, since the potential for a “back-coalescence” 
implies that recombination events do not always change the ancestry, effectively 
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Figure 3: Two-wave admixture: simulations and theory for the ancestry proportions. We sim- 
ulated a two-wave admixture model according to a Markovian Wright-Fisher model (||Gravel| 
|2012| |; as in Figure^ but without averaging pairs of haplotypes) with N = 2500. The other 
model parameters are indicated on top of the figure. We recorded the fraction of each chromo¬ 
some that descends from the A population, and plotted the histogram (circles). The theory 
(based on Eq. 0) is plotted as a solid (blue) line. We then fitted a pulse admixture model 
with just two parameters (m and t) by matching the mean and variance of the empirical data. 
The distribution of the ancestry proportions under the pulse model (Eq. 0 ) is plotted as a 
dashed (purple) line. The best fit for t was 9.7, intermediate between ti and t 2 - 
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increasing segment lengths. Based on the analysis of the SMC’ model by Liang| 
and Nielsenj (2014a), Eq. ([^ should still hold, but with h = tL replaced by 
h = tL, where t = 2N (l — (derivation not shown; note that t ^ t 

for t <C 2N). The second effect arises when the distribution is over a sample 
from the population, reducing the variance due to the fact that lineages may 
coalesce already before reaching the time of admixture. A complete theory is 
yet to be developed, perhaps along the lines of Liang and Nielsen (2014bI. 

While our results provide reasonable accuracy for a parameter regime typical 
of some natural populations, we caution that often, the method may not be di¬ 
rectly applicable. Intuitively, the information exploited by our method is mostly 
in the first moments of the distribution; while we estimate parameters in a more 
principled MLE approach, our method is still prone to inaccuracies in estimating 


ancestry proportions, as in studies based merely on the variance (e.g., (Botigue 


et al. 2013) and our unpublished results for Ashkenazi Jews). Nevertheless, 


our method may serve as a building block (or a sanity check) for more complex 
approaches. For example, our results already clearly demonstrate the inherent 
infeasibility of distinguishing a pulse and a two-wave admixture histories using 
segment lengths and ancestry proportions statistics for recent admixture (past 
« 10 generations). Finally, our theoretical results will be of great interest to 
researchers in population genetics and coalescent theory working in the very 
active field of admixture modeling. 


Software 

Matlab code is available for the inference of m and t for the pulse admixture 
model, as well as for the distributions of segment lengths and ancestry propor¬ 
tions for the two-wave model. See https://github.com/scarmi/admixture_models. 
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